R Markdown testing

So far, so good

I’m participating in this open datascience course and I’m now supposed to start my own project, work on it in RStudio, communicate it by using RMarkdown and share it via GitHub. Let’s see how frightened I will be of this “new technology”.

First I was asked to provide you with this

Check out my GitHub!!

Then to business

The following has nothing to do with the actual course work. I’m just testing out stuff.

I first made the mistake of placing the datafile that I next wanted to read in, to a separate folder inside the IODS-project-folder. Therefore, when I tried to read in the file, my code wasn’t able to find it. After some googling I found out that this was because R console and Rmarkdown have separate independent working directories. You can read about this issue here.

So I start off with checking that I’m in the right working directory

getwd()
## [1] "C:/Users/Sirke Piirainen/Documents/GitHub/IODS-project"

Everything’s in order. So now let’s read in the data that consists of my money expenditure during the past year.

money<-read.csv("Sirkesjam.csv",header=TRUE)
#let's see the first few rows
head(money)
##         date amount
## 1 2018-10-26  10.40
## 2 2018-10-25   8.84
## 3 2018-10-25   7.78
## 4 2018-10-24   0.20
## 5 2018-10-23  60.00
## 6 2018-10-23   3.50

Let’s plot the data to see whether there are any nice (or worrying) patterns to see.

require(ggplot2)
ggplot(data=money,aes(date, amount))+geom_line()

For some reason the plot looks different in R compared to the knitted outcome shown above. I will get back to this later..
Here’s how the plot looks like in R:

My idea is to study whether I use more money on weekends or during holiday seasons. I could imagine that this is the case BUT, normally during weekends and holidays I go out hiking somewhere in the forests where there is less opportunities to use money so I might be wrong.


Linear regression analysis

What I learned this week

This week I’ve learned about data wrangling and linear regression analysis. The basis of all data science is efficient data wrangling. One has to be able to modify, filter, convert etc. the raw data to an appropriate format to be able to perform any statistical analysis with it.

I started my “statistical journey” from linear regression. This analysis finds out whether there are any statistically significant relationships between dependent and explanatory variables. Depending on the number of explaining factors the process is called either simple linear regression or multiple linear regression. I fitted a multiple linear regression model, interpereted the results and checked for the validity of my model.

Here’s how I did it..

Data exploration

Let’s start by reading in the data and exploring the structure and dimensions of the data:

#data can be read in with read.csv-command and it has headers, T means TRUE.
learning2014<-read.csv("C:/Users/Sirke Piirainen/Documents/GitHub/IODS-project/data/learning2014.csv",header = T)
#str-command lists all the variables in the data, their class and first few observations
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
#dim-command lists the number of rows and columns in the data
dim(learning2014)
## [1] 166   7

The data comes from a questionnaire survey that measured students learning when studying statistics. The data consists of 166 observations (individuals) and 7 variables have been measured. Here’s a list explaining the variables:

Age = Age (in years) derived from the date of birth

Attitude = Global attitude toward statistics

Points = Exam points

gender = Gender: M (Male), F (Female)

Deep = Deep approach to learning (on a scale from 1 to 5)

Surf = Surface approach to learning (on a scale from 1 to 5)

Stra = Strategic approach to learning (on a scale from 1 to 5)

You can read more about the study variables here.

Let’s explore the variation in the variables:

#let's take a summary of all the variables
summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

We can see that most of the students that participated in the study are female and around 25 years old. The three learning approach variables (deep, stra and surf) and the points that students have scored, all have a fairly even distribution.

Then let’s visualize how different variables affect the points that students are scoring:

# access the GGally and ggplot2 libraries
library(GGally)
## Warning: package 'GGally' was built under R version 3.4.4
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(ggplot2)

# create a plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col=gender,alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

-Looks like students with higher (better) attitude are scoring higher points. There’s maybe a slight gender division but the main deduction is the same for both genders.

-Deep learning approach doesn’t seem to influence points scored.

-Strategic learning approach seems to have a slight positive effect on points scored.

-Surface learning approach seems to have a slight negative effect on points scored.

-Older student seem to be scoring less points.

In summary, it looks like attitude, strategic and surface learning approaches might affect points most strongly. But to be sure, we want to test it statistically.

Fitting a statistical model

Let’s fit a linear model where points are explained by attitude, stra and surf:

# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra+surf, data = learning2014)

# print out a summary of the model
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

From the summary of the statistical test we can find out that the variables stra and surf do not have a statistically significant influence on points because their p-value (Pr(>|t|)) is greater than 0.05. This means that the points that the students get from a test are not dependent on their strategic or surface learning approach.

Attitude, however, seems to have a strong statistically significant influence on points because its p-value is way below 0.05. It seems that attitude has a great influence on points.

Model selection

Let’s explore which variables we can drop as unnecessary. Probably surf but can we also drop stra?

In the following I use AIC values for model selection. You can read more about them here.

# step-command drops each variable one at a time and counts an AIC value which is the lowest possible for the best model.
step(my_model2)
## Start:  AIC=557.4
## points ~ attitude + stra + surf
## 
##            Df Sum of Sq    RSS    AIC
## - surf      1     15.00 4559.4 555.95
## <none>                  4544.4 557.40
## - stra      1     69.61 4614.0 557.93
## - attitude  1    980.95 5525.3 587.85
## 
## Step:  AIC=555.95
## points ~ attitude + stra
## 
##            Df Sum of Sq    RSS    AIC
## <none>                  4559.4 555.95
## - stra      1     81.74 4641.1 556.90
## - attitude  1   1051.89 5611.3 588.41
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Coefficients:
## (Intercept)     attitude         stra  
##      8.9729       3.4658       0.9137
#drop1-command does pretty much the same thing but drops the variables one at a time in the order they were in the model so you have to be careful in which order you present the variables in the model
drop1(my_model2)
## Single term deletions
## 
## Model:
## points ~ attitude + stra + surf
##          Df Sum of Sq    RSS    AIC
## <none>                4544.4 557.40
## attitude  1    980.95 5525.3 587.85
## stra      1     69.61 4614.0 557.93
## surf      1     15.00 4559.4 555.95

These analysis suggest that we should drop ‘surf’ from the model as a non-influencial variable but keep ‘stra’ still in the game. However, because the AIC difference to the next best model is less than 2, it is usually recommended to choose the simpler model.

Therefore, let’s fit a new model with only attitude:

#new model with only attitude and stra as explaining variables
my_model3 <- lm(points ~ attitude, data = learning2014)

#summary of the new model
summary(my_model3)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

From the summary of the final model we first find a distribution of residuals which describes the difference between the observed values and the estimated values of the sample mean.

Next there is a coefficient table. Regression coefficients represent the mean change in the response variable (attitude) for one unit of change in the predictor variable (attitude) while holding other predictors in the model constant.

The p-value for each variable tests the null hypothesis that the coefficient is equal to zero (no effect).

From these results I deduce that attitude has a strong positive effect.

R-squared is a statistical measure of how close the data are to the fitted regression line. It is the percentage of the response variable variation that is explained by a linear model.

In our case the R-squared value is around 20 % which seems low but in some fields, it is entirely expected that R-squared values will be low. For example, any field that attempts to predict human behavior, such as psychology, typically has R-squared values lower than 50%. Humans are simply harder to predict than physical processes.

The F statistic on the last line is telling you whether the regression as a whole is performing ‘better than random’ - any set of random predictors will have some relationship with the response, so it’s seeing whether my model fits better than I’d expect if all my predictors had no relationship with the response (beyond what would be explained by that randomness). This p-value is below 0.05 so my model makes sense.

Model validation

Then it’s time for model validation. I made certain assumptions about the data when I decided to use a linear regression model in my analysis. Now I have to test that these assumptions hold and are not violated.

Let’s draw some diagnostic plots. These things are easier to interpret visually.

par(mfrow = c(2,2))
plot(my_model2,which=c(1,2,5))

The first plot is a scatter plot of residuals on the y axis and fitted values (estimated responses) on the x axis. The plot is used to detect non-linearity, unequal error variances, and outliers. If the residuals appear to behave randomly, it suggests that the model fits the data well. On the other hand, if non-random structure is evident in the residuals, it is a clear sign that the model fits the data poorly.

In this case everything seems to be ok. There are few values (indicated with an id number) that have lower values but they are not clear outliers on my opinion.

The second plot is basically a Q-Q plot which is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight.

In this case I use a NORMAL Q-Q plot which checks the assumption that the dependent variable is normally distributed. This assumption seems to hold.

The last plot helps me to find influential cases if there are any. Not all outliers are influential in linear regression analysis. When cases are outside of the Cook’s distance (dashed red line not even visible in this plot), the cases are influential to the regression results. The regression results will be altered if one excludes those cases.

In this case there are no influencial cases. The red dashed line outside of which the influencial cases would be found, is not even visible in this plot.

Conclusions

My model seems to be a valid one. It explains around 20 % of the variation in the points that students scored in an exam. The most important factor affecting positively the points was attitude.

So check your attitude!!


Logistic regression

What I learned this week

This week I learned about logistic regression..

Data exploration and selection

Let’s start by reading in the data and exploring the structure and dimensions of the data:

#read in the csv
alc<-read.csv("C:/Users/Sirke Piirainen/Documents/GitHub/IODS-project/data/alc.csv",header = T)

#print the names of the columns
names(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

This data consists of student achievements in secondary education of two Portuguese schools. The data attributes include student grades (G1, G2 and G3), demographic, social and school related features and it was collected by using school reports and questionnaires. Two datasets were provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). These datasets have been combined so that numerical variables are averaged and categorical values are taken directly from the Mathematics dataset. Alcohol consumption of students during week days and weekends has been combined as an average and categorized yes or no for high use. Check the variable details here.

The data consists of 35 variables and 382 observations.

Next I want to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, I choose 4 interesting variables in the data and for each of them, I present a hypothesis about their relationships with alcohol consumption.

  1. famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) I assume students with very bad realationships have high alcohol consumption.

  2. absences - number of school absences (numeric: from 0 to 93) I assume student with lots of absences have high alcohol consumption.

  3. studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) I assume that students with low weekly study time have high alcohol consumption.

  4. higher - wants to take higher education (binary: yes or no) I assume students who don’t want to take higher education have high alcohol consumption.

#get some packages
library(tidyr); library(dplyr); library(ggplot2)

#select only the data that I am interested in
choose<-c("high_use","famrel","absences","studytime","higher")
alc2<-select(alc,one_of(choose))

#summary table of the data
summary(alc2)
##   high_use           famrel         absences      studytime     higher   
##  Mode :logical   Min.   :1.000   Min.   : 0.0   Min.   :1.000   no : 18  
##  FALSE:268       1st Qu.:4.000   1st Qu.: 1.0   1st Qu.:1.000   yes:364  
##  TRUE :114       Median :4.000   Median : 3.0   Median :2.000            
##                  Mean   :3.937   Mean   : 4.5   Mean   :2.037            
##                  3rd Qu.:5.000   3rd Qu.: 6.0   3rd Qu.:2.000            
##                  Max.   :5.000   Max.   :45.0   Max.   :4.000
#draw barplots of variables to study their distribution
gather(alc2) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")+geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped

None of the variables seem to be evenly distributed.

library(GGally)

# create a plot matrix with ggpairs()
p <- ggpairs(alc2, mapping = aes(col=high_use,alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

-There are more students that don’t have a high use of alcohol than those that do.

-There are more students that want to take higher education than those who don’t want to.

-There are many students with very little absences, some with a few absences and then again many student with more absences.

-Family relations seem to be fairly good for most of the students.

-Most of the students study 2-5 hours weekly. Only few students study a lot.

Looks like all the hypothesis that I made hold some truth although I guess I might not have enough data from all observed categories to verify my assumptions.

Fitting a logistic regression model

#fit a logistic regression model
m <- glm(high_use ~ famrel + absences+studytime+higher-1, data = alc2, family = "binomial")

#plot a summary table
summary(m)
## 
## Call:
## glm(formula = high_use ~ famrel + absences + studytime + higher - 
##     1, family = "binomial", data = alc2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3321  -0.8215  -0.6473   1.1762   2.1735  
## 
## Coefficients:
##           Estimate Std. Error z value Pr(>|z|)    
## famrel    -0.23369    0.12639  -1.849 0.064466 .  
## absences   0.07753    0.02273   3.411 0.000647 ***
## studytime -0.52003    0.15754  -3.301 0.000963 ***
## higherno   1.13706    0.73724   1.542 0.122996    
## higheryes  0.67440    0.60457   1.115 0.264638    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 529.56  on 382  degrees of freedom
## Residual deviance: 429.40  on 377  degrees of freedom
## AIC: 439.4
## 
## Number of Fisher Scoring iterations: 4

From the summary table I can see that the relationship:

-between high use and famrel is not significant but students with better (higher) relations are less likely in the high use category.

-between absences and high use is highly significant so that students with many absences are more likely in the high use category.

-between study time and high use is also highly significant so that student who spend more time studying are less likely in the high use category.

-between higher education and high use is not significant but is such that students with no willingness to take higher education are more likely in the high use category.

Model selection

I study the factorial variable ‘higher’ a bit more. I check whether the variable ‘higher’ improves the model fit, I fit one model with (my.mod1) and one without the variable ‘higher’ (my.mod2) and conduct a likelihood ratio test. This tests the hypothesis that all coefficients of higher are zero:

my.mod1 <- glm(high_use ~ famrel + absences+studytime+higher-1, data = alc2, family = "binomial") 
my.mod2 <- glm(high_use ~ famrel + absences+studytime-1, data = alc2, family = "binomial")

anova(my.mod1, my.mod2, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ famrel + absences + studytime + higher - 1
## Model 2: high_use ~ famrel + absences + studytime - 1
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       377     429.40                     
## 2       379     431.77 -2  -2.3695   0.3058

This test tells me that having the variable ‘higher’ in the model doesn’t improve it so I can drop it. I fit a new model without it:

#the new model again
m2 <- glm(high_use ~ famrel + absences+studytime, data = alc2, family = "binomial")
#summary table of the new model
summary(m2)
## 
## Call:
## glm(formula = high_use ~ famrel + absences + studytime, family = "binomial", 
##     data = alc2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1549  -0.8286  -0.6531   1.1925   2.1845  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.75052    0.59833   1.254 0.209713    
## famrel      -0.23533    0.12630  -1.863 0.062431 .  
## absences     0.07773    0.02252   3.452 0.000557 ***
## studytime   -0.54411    0.15551  -3.499 0.000467 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 430.19  on 378  degrees of freedom
## AIC: 438.19
## 
## Number of Fisher Scoring iterations: 4

Next I’ll try to dropping the variable ‘famrel’ because it doesn’t seem to be significant (p-value less than 0.05). I fit a model without ‘famrel’.

m3<-glm(formula = high_use ~ absences + studytime, family = "binomial", 
    data = alc2)

summary(m3)
## 
## Call:
## glm(formula = high_use ~ absences + studytime, family = "binomial", 
##     data = alc2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2128  -0.8387  -0.7046   1.1996   2.1832  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.16640    0.33954  -0.490 0.624087    
## absences     0.08054    0.02285   3.524 0.000425 ***
## studytime   -0.55015    0.15550  -3.538 0.000403 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 433.65  on 379  degrees of freedom
## AIC: 439.65
## 
## Number of Fisher Scoring iterations: 4

The model without ‘famrel’ has a higher AIC value than the model including it. When comparing AIC values one should choose the model with the lowest AIC value. But if the difference between the best models in less than 2 units it is adviceable to choose the simpler one. Based on this advice I drop ‘famrel’ and my final model includes only absences and studytime as explaining variables.

Next I count the coefficients of the final model as odds ratios and provide confidence intervals for them:

Counting the odds

#count odd ratios 
OR <- coef(m3) %>% exp

# compute confidence intervals (CI)
CI<-confint(m3)%>%exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.8467110 0.4349166 1.6508083
## absences    1.0838772 1.0386751 1.1361111
## studytime   0.5768622 0.4211716 0.7758799

Odd ratios higher than 1 means that this variable is positively associated with “success”, in this case being in the high use category. In this case, the variable ‘absences’ is positively associated with high use and ‘studytime’ is negatively associated.

From the confidence intervals I can check that 1 doesn’t occur within them because it would mean that the variable has no effect on the success. I can also observe whether the variable has a lot of variation in the intervals. A large variation in the intervals indicates a low level of precision of the odds ratio, whereas a small variation in intervals indicates a higher precision of the odds ratio.

In my case studytime has rather wide confidence intervals and therefore it might have a lower precision of the odds ratio.

Predictions

To validate my model I use it to make predictions and compare them with the true observed values.

# predict() the probability of high_use
probabilities <- predict(m3, type = "response")

# add the predicted probabilities to 'alc'
alc2 <- mutate(alc2, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc2 <- mutate(alc2, prediction = probability>0.5)
## Warning: package 'bindrcpp' was built under R version 3.4.4
# see the last ten original classes, predicted probabilities, and class predictions
select(alc2, studytime, absences, high_use, probability, prediction) %>% tail(10)
##     studytime absences high_use probability prediction
## 373         1        0    FALSE   0.3281537      FALSE
## 374         1        7     TRUE   0.4618903      FALSE
## 375         3        1    FALSE   0.1497827      FALSE
## 376         1        6    FALSE   0.4419431      FALSE
## 377         3        2    FALSE   0.1603317      FALSE
## 378         2        2    FALSE   0.2486902      FALSE
## 379         2        2    FALSE   0.2486902      FALSE
## 380         2        3    FALSE   0.2640419      FALSE
## 381         1        4     TRUE   0.4026660      FALSE
## 382         1        2     TRUE   0.3645990      FALSE
# tabulate the target variable versus the predictions
table(high_use = alc2$high_use, prediction = alc2$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   256   12
##    TRUE     96   18
# tabulate the target variable versus the predictions
table(high_use = alc2$high_use, prediction = alc2$prediction)%>%prop.table()%>%addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67015707 0.03141361 0.70157068
##    TRUE  0.25130890 0.04712042 0.29842932
##    Sum   0.92146597 0.07853403 1.00000000

Above I produced two confusion tables (in absolute numbers and in percentage) from which I can see how many or what proportion of the predictions were correct.

Then the same thing as a plot:

g <- ggplot(alc2, aes(x = probability, y = high_use,col=prediction))

# define the geom as points and draw the plot
g+geom_point()

In this plot I can see both the actual values and the predictions. I have quite a lot of predictions of falses even though they were actually true. But is not as bad as predicting true even though they were falses in reality. falc

To quantify the goodness of my model I can count the proportion of incorrect predictions.

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions
lfm3<-loss_func(class = alc2$high_use, prob = alc2$probability)
lfm3
## [1] 0.2827225

The result is 28 %. I could’ve counted that also from the second cross table (25 + 3 = 28 %). On my opinion that sounds like a lot but I guess it is still better than a simple guessing strategy. 72 % for correct predictions sounds better :)

The problem with this kind of model validation is that the model is asked to predict the same values that are used to create the model. This seems like a circular argument.

Cross-validation

Another way of doing model validation is cross-validation. Here a proportion of the data is set aside as testing data and the model is fitted with training data (the rest) only. The resulting model is then fed with testing data to make predictions. The goodness of the model is then quantified as the proportion of correct predictions. This procedure is repeated several times so that the whole dataset has been used as testing data. If the cross-validation is 5-fold it means that the data are split into 5 proportions of which 1/5 is used as testing dataset one at a time. Which data to select as testing data can be random or for example spatially defined.

In R cross-validation can be done easily several times so I reapeat the following r code a couple of times:

# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc2, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2931937

All the results that I get are close to 29-30 %. This means that with cross-validation my model seems to be performing worse than with the loss function I performed earlier (and the model used in DataCamp). I assume this is because the cross-validation procedure is more profound and therefore gives a more realistic valuation of the prediction accuracy. Without a proper model validation the results might be too over-optimistic.

Comparing complicated and simple models with cross-validation metrics

Next I do a little exploration. I compare the performance of different logistic regression models (= different sets of predictors). I build models using 30, 19, 10 and 5 variables. I count training and testing errors for all the models and compare them.

names(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
#select 30 variables
chooseM1<-c("age","high_use","famrel","absences","studytime","higher","activities","nursery", "internet","romantic","G3","address","Pstatus","Medu","Fedu","Mjob","Fjob", "reason","guardian","traveltime","failures","schoolsup", "famsup","paid", "freetime", "goout","health","school","sex","famsize")

alcM1<-select(alc,one_of(chooseM1))

#29 explaining variables
M1<-glm(high_use ~ school+higher+sex+age+ address+ famsize+ Medu+ Fedu+ Mjob+ Fjob+ reason+ nursery+ internet+ guardian+ traveltime+ failures+ schoolsup+ famsup+ paid+ freetime+ goout+ health+absences+studytime+famrel+activities+romantic+G3+Pstatus, data = alcM1, family = "binomial")

summary(M1)
## 
## Call:
## glm(formula = high_use ~ school + higher + sex + age + address + 
##     famsize + Medu + Fedu + Mjob + Fjob + reason + nursery + 
##     internet + guardian + traveltime + failures + schoolsup + 
##     famsup + paid + freetime + goout + health + absences + studytime + 
##     famrel + activities + romantic + G3 + Pstatus, family = "binomial", 
##     data = alcM1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8741  -0.6724  -0.3610   0.4884   2.8798  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -5.15777    3.12129  -1.652 0.098443 .  
## schoolMS         -0.63603    0.54400  -1.169 0.242336    
## higheryes        -0.09098    0.69925  -0.130 0.896481    
## sexM              0.78325    0.32823   2.386 0.017019 *  
## age               0.11618    0.14915   0.779 0.435985    
## addressU         -0.79383    0.41437  -1.916 0.055398 .  
## famsizeLE3        0.50910    0.32421   1.570 0.116356    
## Medu              0.04330    0.22061   0.196 0.844378    
## Fedu              0.17994    0.19563   0.920 0.357689    
## Mjobhealth       -1.07131    0.76315  -1.404 0.160382    
## Mjobother        -0.88901    0.49991  -1.778 0.075350 .  
## Mjobservices     -0.87464    0.57337  -1.525 0.127151    
## Mjobteacher      -0.28326    0.69357  -0.408 0.682977    
## Fjobhealth        0.39818    1.19257   0.334 0.738465    
## Fjobother         0.84816    0.86476   0.981 0.326685    
## Fjobservices      1.42043    0.88485   1.605 0.108434    
## Fjobteacher      -0.44976    1.03350  -0.435 0.663434    
## reasonhome        0.46349    0.38041   1.218 0.223067    
## reasonother       1.49313    0.54221   2.754 0.005891 ** 
## reasonreputation -0.07508    0.41153  -0.182 0.855246    
## nurseryyes       -0.53587    0.36730  -1.459 0.144582    
## internetyes       0.05745    0.45599   0.126 0.899736    
## guardianmother   -0.80193    0.36635  -2.189 0.028601 *  
## guardianother    -0.20393    0.79701  -0.256 0.798051    
## traveltime        0.32510    0.23173   1.403 0.160644    
## failures          0.22823    0.26686   0.855 0.392408    
## schoolsupyes      0.07851    0.46427   0.169 0.865721    
## famsupyes        -0.08117    0.32407  -0.250 0.802224    
## paidyes           0.59908    0.31690   1.890 0.058699 .  
## freetime          0.30046    0.16728   1.796 0.072471 .  
## goout             0.86758    0.15351   5.651 1.59e-08 ***
## health            0.18614    0.11321   1.644 0.100130    
## absences          0.08983    0.02664   3.372 0.000747 ***
## studytime        -0.31530    0.20585  -1.532 0.125599    
## famrel           -0.56897    0.16717  -3.404 0.000665 ***
## activitiesyes    -0.57526    0.30855  -1.864 0.062265 .  
## romanticyes      -0.54641    0.33105  -1.651 0.098840 .  
## G3                0.03375    0.05190   0.650 0.515575    
## PstatusT         -0.36418    0.53795  -0.677 0.498421    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 322.06  on 343  degrees of freedom
## AIC: 400.06
## 
## Number of Fisher Scoring iterations: 5
# predict() the probability of high_use
probabilities <- predict(M1, type = "response")

# add the predicted probabilities to 'alc'
alcM1 <- mutate(alcM1, probability = probabilities)

# use the probabilities to make a prediction of high_use
alcM1 <- mutate(alcM1, prediction = probability>0.5)

# call loss_func to compute the average number of wrong predictions
lfM1<-loss_func(class = alcM1$high_use, prob = alcM1$probability)

#cross-validation
cv <- cv.glm(data = alcM1, cost = loss_func, glmfit = M1, K = 10)

# average number of wrong predictions in the cross validation
cvM1<-cv$delta[1]

#####################################################################
#19 variables
chooseM2<-c("high_use","famrel","absences","studytime","activities","nursery", "romantic","address", "reason","guardian","traveltime","failures","paid", "freetime", "goout","health","school","sex","famsize")

alcM2<-select(alc,one_of(chooseM2))

#18 explaining variables
M2<-glm(high_use ~ school+sex+ address+ famsize+ reason+ nursery+ guardian+ traveltime+ failures+ paid+ freetime+ goout+health+absences+studytime+famrel+activities+romantic, data = alcM2, family = "binomial")

summary(M2)
## 
## Call:
## glm(formula = high_use ~ school + sex + address + famsize + reason + 
##     nursery + guardian + traveltime + failures + paid + freetime + 
##     goout + health + absences + studytime + famrel + activities + 
##     romantic, family = "binomial", data = alcM2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9380  -0.6912  -0.3921   0.5668   2.7954  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.24229    1.12611  -1.991 0.046461 *  
## schoolMS         -0.36846    0.50370  -0.732 0.464470    
## sexM              0.83559    0.30472   2.742 0.006104 ** 
## addressU         -0.86275    0.38198  -2.259 0.023905 *  
## famsizeLE3        0.48723    0.30501   1.597 0.110173    
## reasonhome        0.25059    0.36062   0.695 0.487133    
## reasonother       1.11296    0.49712   2.239 0.025168 *  
## reasonreputation -0.28950    0.37982  -0.762 0.445936    
## nurseryyes       -0.41960    0.33948  -1.236 0.216463    
## guardianmother   -0.72823    0.33009  -2.206 0.027373 *  
## guardianother     0.10564    0.73524   0.144 0.885750    
## traveltime        0.24550    0.21934   1.119 0.263024    
## failures          0.17320    0.23209   0.746 0.455499    
## paidyes           0.69857    0.29234   2.390 0.016870 *  
## freetime          0.17239    0.15346   1.123 0.261274    
## goout             0.84707    0.14158   5.983 2.19e-09 ***
## health            0.13123    0.10449   1.256 0.209140    
## absences          0.09286    0.02425   3.830 0.000128 ***
## studytime        -0.25968    0.19222  -1.351 0.176709    
## famrel           -0.48025    0.16126  -2.978 0.002900 ** 
## activitiesyes    -0.43744    0.28545  -1.532 0.125416    
## romanticyes      -0.50198    0.30813  -1.629 0.103287    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 338.38  on 360  degrees of freedom
## AIC: 382.38
## 
## Number of Fisher Scoring iterations: 5
# predict() the probability of high_use
probabilities <- predict(M2, type = "response")

# add the predicted probabilities to 'alc'
alcM2 <- mutate(alcM2, probability = probabilities)

# use the probabilities to make a prediction of high_use
alcM2 <- mutate(alcM2, prediction = probability>0.5)

# call loss_func to compute the average number of wrong predictions
lfM2<-loss_func(class = alcM2$high_use, prob = alcM2$probability)

cv <- cv.glm(data = alcM2, cost = loss_func, glmfit = M2, K = 10)

# average number of wrong predictions in the cross validation
cvM2<-cv$delta[1]

##########################################################################3
#10 variables
chooseM3<-c("high_use","famrel","absences","address", "reason","guardian","paid", "goout","sex","famsize")

alcM3<-select(alc,one_of(chooseM3))

#9 explaining variables
M3<-glm(high_use ~ sex+ address+ famsize+ reason+ guardian+  paid+  goout+absences+famrel, data = alcM3, family = "binomial")

summary(M3)
## 
## Call:
## glm(formula = high_use ~ sex + address + famsize + reason + guardian + 
##     paid + goout + absences + famrel, family = "binomial", data = alcM3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9617  -0.7446  -0.4527   0.6513   2.8447  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.13120    0.78716  -2.707 0.006780 ** 
## sexM              0.99325    0.27671   3.589 0.000331 ***
## addressU         -0.91297    0.33198  -2.750 0.005959 ** 
## famsizeLE3        0.44078    0.29197   1.510 0.131128    
## reasonhome        0.15096    0.34314   0.440 0.659987    
## reasonother       0.89873    0.46825   1.919 0.054940 .  
## reasonreputation -0.53510    0.35986  -1.487 0.137017    
## guardianmother   -0.70149    0.31737  -2.210 0.027085 *  
## guardianother     0.33524    0.69442   0.483 0.629265    
## paidyes           0.52463    0.27638   1.898 0.057669 .  
## goout             0.86127    0.13257   6.497 8.21e-11 ***
## absences          0.09035    0.02330   3.877 0.000106 ***
## famrel           -0.43437    0.14964  -2.903 0.003699 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 354.60  on 369  degrees of freedom
## AIC: 380.6
## 
## Number of Fisher Scoring iterations: 5
# predict() the probability of high_use
probabilities <- predict(M3, type = "response")

# add the predicted probabilities to 'alc'
alcM3 <- mutate(alcM3, probability = probabilities)

# use the probabilities to make a prediction of high_use
alcM3 <- mutate(alcM3, prediction = probability>0.5)

# call loss_func to compute the average number of wrong predictions
lfM3<-loss_func(class = alcM3$high_use, prob = alcM3$probability)

cv <- cv.glm(data = alcM3, cost = loss_func, glmfit = M3, K = 10)

# average number of wrong predictions in the cross validation
cvM3<-cv$delta[1]

##########################################################################3
#5 variables
chooseM4<-c("high_use","famrel","absences", "goout","sex")

alcM4<-select(alc,one_of(chooseM4))

#4 explaining variables
M4<-glm(high_use ~ sex+  goout+absences+famrel, data = alcM4, family = "binomial")

summary(M4)
## 
## Call:
## glm(formula = high_use ~ sex + goout + absences + famrel, family = "binomial", 
##     data = alcM4)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7151  -0.7820  -0.5137   0.7537   2.5463  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.76826    0.66170  -4.184 2.87e-05 ***
## sexM         1.01234    0.25895   3.909 9.25e-05 ***
## goout        0.76761    0.12316   6.232 4.59e-10 ***
## absences     0.08168    0.02200   3.713 0.000205 ***
## famrel      -0.39378    0.14035  -2.806 0.005020 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 379.81  on 377  degrees of freedom
## AIC: 389.81
## 
## Number of Fisher Scoring iterations: 4
# predict() the probability of high_use
probabilities <- predict(M4, type = "response")

# add the predicted probabilities to 'alc'
alcM4 <- mutate(alcM4, probability = probabilities)

# use the probabilities to make a prediction of high_use
alcM4 <- mutate(alcM4, prediction = probability>0.5)

# call loss_func to compute the average number of wrong predictions
lfM4<-loss_func(class = alcM4$high_use, prob = alcM4$probability)

cv <- cv.glm(data = alcM4, cost = loss_func, glmfit = M4, K = 10)

# average number of wrong predictions in the cross validation
cvM4<-cv$delta[1]

#create a data frame containing the number of variables and training and testing errors
Nvariables<-c(5,10,19,30)
errors<-as.data.frame(Nvariables)
errors$train=NA
errors[1, 2] = lfM4
errors[2, 2] = lfM3
errors[3, 2] = lfM2
errors[4, 2] = lfM1
errors$test=NA
errors[1, 3] = cvM4
errors[2, 3] = cvM3
errors[3, 3] = cvM2
errors[4, 3] = cvM1

#plot the results
dfplot <- errors %>% gather(key, value, -Nvariables)

ggplot(dfplot, mapping = aes(x = Nvariables, y = value, color = key) ) + geom_line()

It seems that reducing the number of explaining variables from 30 to 5 has little effect on training errors whereas for testing errors it has a significant negative effect. It reduces the testing error and models with less variables seem to give more accurate predictions.

I’m not sure what exactly I can interpret from this result because as I reduced the number of explaining variables I dropped them based on their statistical insignificance.


Clustering and classification

This week I learned about clustering and classification. How to cluster observations, how to study which factors affect or justify clustering, how many clusters is appropriate etc?

2. Data input and summary

This week’s data comes from an R package called MASS:

# access the MASS package
library(MASS)
## Warning: package 'MASS' was built under R version 3.4.4
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data("Boston")

# explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The dataset consist of 14 variables and 506 observations. All variables are numerical. One variable (‘chas’) is a 1/0, presence/absence dummy variable. The variables describe housing values in suburbs of Boston and factors measured at the suburbs which are thought to be related with housing values. Factors include measures of for example crime rate, access to Charles River, nitrogen oxides concentration, average number of rooms per dwelling, distances to five Boston employment centres, accessibility to radial highways, proportion of blacks by town and median value of owner-occupied homes. The full details can be found here.

3. Data exploration

Let’s explore graphically the distributions and relations of the data:

pairs(Boston)

This plot is difficult to read. I’ll figure out later how to improve the quality of the output. From the summary table I’m however able to explore the variation and distributions of the variables.

Here are two dotplots of variables ‘crim’ (per capita crime rate by town) and ‘zn’ (proportion of residential land zoned for lots over 25,000 sq.ft.) which show that these two variables are not very evenly distributed:

dotchart(Boston$crim)

dotchart(Boston$zn)

Some variables seem correlated. Here’s a correlation matrix of the variables:

library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.4
## corrplot 0.84 loaded
library(magrittr)

# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>%round(digits=2)

# print the correlation matrix
print(cor_matrix)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00

The above matrix is not very readable as it extends into two separate parts. Let’s present the correlations in a nicer way.

# visualize the correlation matrix
corrplot(cor_matrix, method="circle",type="upper",cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

This plot is easier to read. The bigger the circle the more correlated the variables are. Red indicates negative correlation and blue indicated positive correlation.

4. Standardize dataset

Some of the variables have very high values and wide distributions. We want to scale all variables because later on it may be difficult to sum or average variables that are on different scales. Scaling can be done to all variables in the dataset as they are all numerical.

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled<-as.data.frame(boston_scaled)

Now all the variables have their mean at zero and their distributions are more moderate.

4. Create a categorical variable

Next I create a categorical variable of the crime rate in the Boston dataset. I use quantiles as the break points. I drop the old crime rate variable from the dataset.

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE,label=c("low","med_low","med_high","high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

4. Divide the data into training and testing sets

For later model evaluation purposes I divide the dataset into training and testing datasets, so that 80% of the data belongs to the train set:

##dividing the data into training and testing sets

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

5. Fit a linear discriminant analysis on the train set.

Next I want to know which variables might explain the target variable crime rate. I do a linear discriminant analysis with the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables:

# linear discriminant analysis
lda.fit <- lda(crime~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2599010 0.2574257 0.2301980 
## 
## Group means:
##                   zn      indus         chas        nox         rm
## low       0.98949884 -0.9030454 -0.079333958 -0.8762533  0.4580255
## med_low  -0.08113245 -0.3066922 -0.009855719 -0.5847777 -0.1216352
## med_high -0.39488943  0.2350252  0.295521928  0.3961168  0.0294096
## high     -0.48724019  1.0149946 -0.060657012  0.9962637 -0.3890523
##                 age        dis        rad        tax     ptratio
## low      -0.9105094  0.8782527 -0.6722355 -0.7354632 -0.41691230
## med_low  -0.2838192  0.3585119 -0.5498288 -0.4764756 -0.07008217
## med_high  0.3788274 -0.3736974 -0.3976990 -0.2749146 -0.19220369
## high      0.7885245 -0.8453531  1.6596029  1.5294129  0.80577843
##               black      lstat        medv
## low       0.3741162 -0.7779813  0.52377288
## med_low   0.3083986 -0.1250603 -0.01019437
## med_high  0.1165666  0.0477216  0.12094959
## high     -0.8988923  0.8471605 -0.69043588
## 
## Coefficients of linear discriminants:
##                  LD1           LD2         LD3
## zn       0.083997038  0.7173626033 -0.89323527
## indus    0.033505611 -0.2257427905  0.21042219
## chas    -0.064677615 -0.0789752501  0.03358591
## nox      0.255917654 -0.8223216412 -1.54772191
## rm      -0.101041075  0.0005479474 -0.12551422
## age      0.301129407 -0.2488880179  0.15659154
## dis     -0.106519457 -0.2995733219  0.20894309
## rad      3.190005830  0.9575551654 -0.02215920
## tax     -0.003077552 -0.0628363337  0.57200802
## ptratio  0.133296450 -0.0462622218 -0.36172841
## black   -0.180035400 -0.0310410121  0.07040604
## lstat    0.119194098 -0.2274669768  0.33977727
## medv     0.168574061 -0.4483971838 -0.23260203
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9431 0.0427 0.0142
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2,col=classes,pch=classes)
lda.arrows(lda.fit, myscale = 1)

Variable ‘rad’ looks like a strong classifying factor. Also ‘zn’ and ‘nox’ are dividing the observations.

6. Predict

Next I want to use the observations in the test set to predict crime classes. I do this because I want to estimate the “goodness” of my model by comparing predictions to observed “real” data.

For prediction I use the LDA model on the test data. For comparison I tabulate the results with the crime categories from the test set:

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       13      10        2    0
##   med_low    2      13        6    0
##   med_high   0       5       17    0
##   high       0       0        1   33

I did the random division of train and test data and predicted the above classes twice. First I got a fairly poor result with more than half of the med_high cases predicted incorrectly. On the second round the results look better (results shown here). Some classes are still incorrectly predicted but at least most of the predictions are correct.

7. K-means clustering

Next I study the boston data without any classifications and try to cluster the data into groups. Maybe the observations form clusters according to the suburbs. I run k-means algorithm on the dataset, investigate what is the optimal number of clusters and run the algorithm again.

First I reload the Boston dataset and standardize it. Then I calculate the Euklidean distances between the observations and present a summary of the distances:

#standardize the data set
boston_scaled2 <- scale(Boston)

# class of the boston_scaled object
class(boston_scaled2)
## [1] "matrix"
# change the object to data frame
boston_scaled2<-as.data.frame(boston_scaled2)

# euclidean distance matrix
dist_eu <- dist(boston_scaled2)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Next I run the k-means clustering with 3 centers.

# k-means clustering
km <-kmeans(boston_scaled2, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled2[9:14], col = km$cluster)

I zoomed in to various parts of the plot and found that when looking at the variable ‘tax’ it is divided into clusters so that at least the black observations belong clearly to their own group.

I also explored the clustering with 5 centers. The grouping seemed even more arbitrary.

Now, I’m not sure about the best number of clusters so I count the total of within cluster sum of squares (WCSS) and see how it behaves when the number of clusters change:

library(ggplot2)

# set values
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

The total WCSS drops dramatically at around the value 2. That is the optimal number of clusters for this dataset.

I run the clustering again with 2 centers:

# k-means clustering
km <-kmeans(boston_scaled2, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled2[1:6], col = km$cluster)

Now the clustering seems better, at least for some variable pairs. But on my opinion, having only two groups doesn’t tell much. Maybe it suggests that the residents in Boston are divided into two groups, the wealthy and the poor?

Bonus:

Next I perform the LDA again to the boston dataset, this time with clusters (3) as the target variable. By visualizing the results with a biplot I can interpret which variables influence the clustering.

boston_scaled3<-boston_scaled2

# k-means clustering
km <-kmeans(boston_scaled3, centers = 3)

klusteri<-km$cluster
class(klusteri)
## [1] "integer"
boston_scaled3<-cbind(boston_scaled3,klusteri)
summary(boston_scaled3)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv            klusteri    
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063   Min.   :1.000  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989   1st Qu.:1.000  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449   Median :2.000  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   :1.972  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683   3rd Qu.:3.000  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865   Max.   :3.000
# linear discriminant analysis
lda.fit2 <- lda(klusteri~., data = boston_scaled3)

# print the lda.fit object
lda.fit2
## Call:
## lda(klusteri ~ ., data = boston_scaled3)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.3003953 0.4268775 0.2727273 
## 
## Group means:
##         crim         zn      indus        chas        nox         rm
## 1  0.8942488 -0.4872402  1.0913679 -0.01330932  1.1109351 -0.4609873
## 2 -0.3688324 -0.3935457 -0.1369208  0.07398993 -0.1662087 -0.1700456
## 3 -0.4076669  1.1526549 -0.9877755 -0.10115080 -0.9634859  0.7739125
##          age         dis        rad        tax     ptratio      black
## 1  0.7828949 -0.84882600  1.3656860  1.3895093  0.63256391 -0.7083974
## 2  0.1673019 -0.07766431 -0.5799077 -0.5409630 -0.04596655  0.2680397
## 3 -1.1241828  1.05650031 -0.5965522 -0.6837494 -0.62478941  0.3607235
##         lstat        medv
## 1  0.90799414 -0.69550394
## 2 -0.05818052 -0.04811607
## 3 -0.90904433  0.84137443
## 
## Coefficients of linear discriminants:
##                  LD1         LD2
## crim     0.043702606  0.16161136
## zn       0.049248495  0.76920932
## indus   -0.331498698  0.02870425
## chas    -0.012406954 -0.11314905
## nox     -0.721972554  0.40566595
## rm       0.174541989  0.41632858
## age      0.006221178 -0.88117192
## dis      0.043869924  0.36910493
## rad     -1.256861546  0.47665247
## tax     -0.992855786  0.46457291
## ptratio -0.092336951 -0.01003010
## black    0.073915653 -0.03513128
## lstat   -0.372145848  0.38403679
## medv    -0.058153798  0.49571753
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8785 0.1215
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(boston_scaled3$klusteri)

# plot the lda results
plot(lda.fit2, dimen = 2,col=classes,pch=classes)
lda.arrows(lda.fit2, myscale = 1)

From these results I would interpret that the variable ‘rad’ (index of accessibility to radial highways) is the strongest linear separator in this dataset. Although many other variables follow not far behind.

Super-Bonus:

Next I’ll draw some 3D plots of the training data. NOTE! You might have to click around to see the figure.

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

library(plotly)
## Warning: package 'plotly' was built under R version 3.4.4
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
#Set the color to be the crime classes of the train set. Draw another 3D plot where the #color is defined by the clusters of the k-means. How do the plots differ?

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers',color=train$crime)  

I stop the exercise here because I’m having trouble understanding the instructions. I’m able to draw these two 3D plots and crime seems to be a strong separator in the dataset. The last plot should demonstrate the division by clusters. However, I’m not sure anymore should I do the k-means clustering again to the training data and then change the plotting code or could I do it just by modifying the color argument. I leave it here.